The aim of this tutorial is to get familiar with the use of decision trees and their generalizations on simple examples using scikit-learn tools.
You will need to install packages python-graphviz first. If needed, uncomment the conda command below:
# If needed, uncomment the line below:
# pip install graphviz
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_wine
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_absolute_error, accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import graphviz
import pandas as pd
import random
rng_seed = np.random.seed(0)
import warnings
warnings.filterwarnings("ignore")
The data for this tutorial is famous. Called, the iris dataset, it contains four variables measuring various parts of iris flowers of three related species, and then a fourth variable with the species name. The reason it is so famous in machine learning and statistics communities is because the data requires very little preprocessing (i.e. no missing values, all features are floating numbers, etc.).
iris = load_iris()
What is the structure of the object iris ?
Plot this dataset in a well chosen set of representations to explore the data.
iris is from the Bunch class of Sklearn. It is a kind of dictionary where elements can also be called as attributes. Iris being a dataset, it has this list of attributes:
--> Given that iris is a dictionary, we use the pandas library to tranform it into a DataFrame and explore the data in a well chosen set.
pandas to manipulate the data¶Pandas is great to manipulate data in a Microsoft Excel like way.
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 |
# Add a new column with the species names, this is what we are going to try to predict
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
import seaborn as sns
sns.pairplot(df, hue="species")
<seaborn.axisgrid.PairGrid at 0x7f8f12708d60>
Thanks to this visualization, we can see that it will be possible to separate the different species using the available attributes. However, we must use all of them because it is impossible to separate them using only 2 attributes.
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| sepal length (cm) | 1.000000 | -0.117570 | 0.871754 | 0.817941 |
| sepal width (cm) | -0.117570 | 1.000000 | -0.428440 | -0.366126 |
| petal length (cm) | 0.871754 | -0.428440 | 1.000000 | 0.962865 |
| petal width (cm) | 0.817941 | -0.366126 | 0.962865 | 1.000000 |
The attributes appear to be highly correlated. We can try later to use only attributes with little correlation, we should not get a drop in results.
Create a new column that for each row, generates a random number between 0 and 1, and if that value is less than or equal to .75, then sets the value of that cell as True and false otherwise. This is a quick and dirty way of randomly assigning some rows to be used as the training data and some as the test data.
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | is_train | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | True |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | True |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | True |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | True |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | True |
# Create two new dataframes, one with the training rows, one with the test rows
train, test = df[df['is_train']==True], df[df['is_train']==False]
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))
Number of observations in the training data: 118 Number of observations in the test data: 32
# Create a list of the feature column's names
features = df.columns[:4].tolist()
# View features
features
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]
The method tree.DecisionTreeClassifier() from scikit-learn builds decision trees objects as follows:
clf = tree.DecisionTreeClassifier(random_state=rng_seed)
clf = clf.fit(train[features], y)
# Using the whole dataset you may use directly:
#clf = clf.fit(iris.data, iris.target)
The export_graphviz exporter supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. Jupyter notebooks also render these plots inline automatically:
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
We can also export the tree in Graphviz format and savethe resulting graph in an output file iris.pdf:
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("iris")
'iris.pdf'
After being fitted, the model can then be used to predict the class of samples:
class_pred = clf.predict(iris.data[:1, :])
iris.target_names[class_pred[0]]
'setosa'
Train the decision tree on the iris dataset and explain how one should read blocks in graphviz representation of the tree.
Plot the regions of decision with the points of the training set superimposed.
*Indication: you may find the function plt.contourf useful.
Réponse 1 : Pour entraîner l'arbre de décision sur les données du dataset Iris et les plots en utilisant Graphviz, on reprend les codes précédents :
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
clf = tree.DecisionTreeClassifier(random_state=rng_seed)
clf = clf.fit(train[features], y)
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
y_pred_train = clf.predict(train[features])
Explication : Un arbre de décision se compose de noeuds (carrés intermédiaires) et de feuilles (carrés en bout de chaîne). Chaque noeud comporte l'information relative à une décision. Si toutes les informations du noeud sont respectées, on passe au noeud enfant suivant la flèche True, sinon on passe au noeud enfant suivant la flèche False. Cela permet en bout de course d'arriver à une feuille qui comporte l'information de la décision prise par l'arbre au final.
Réponse 2 : On dessine les frontières de décision (entre chaque paire d'attributs) relative à l'arbre de décision entraîné précédemment. Le code provient de Sklearn.
from sklearn.inspection import DecisionBoundaryDisplay
from itertools import combinations
attr = list(combinations(df.columns[:-2], 2))
attr = list(map(list, attr)) # List of all pairs of attributes in the iris dataset
for i, pair in enumerate(attr):
X = df[pair].to_numpy()
y = iris.target
clf_attr = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X,y)
ax = plt.subplot(2,3,i+1)
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
# Decision Boundaries
DecisionBoundaryDisplay.from_estimator(clf_attr, X, cmap = "brg", response_method="predict", ax=ax,
xlabel=pair[0],
ylabel=pair[1])
# Data Points
for t, color in zip(range(3), "brg"):
idx = np.where(t == y)
plt.scatter(X[idx, 0],
X[idx, 1],
c = color,
cmap = "brg",
edgecolors="black",
s = 20)
plt.suptitle("Ensemble de frontières de décisions 2 à 2 données par un DecisionTree");
Comme nous pouvons le voir, les frontières de décisions ne sont pas linéaire du fait de l'utilisation d'un arbre de décision.
Build 2 different trees based on a sepal features (sepal lengths, sepal widths) vs petal features (petal lengths, petal widths) only: which features are the most discriminant?
Compare performances with those obtained using all features.
Try the same as above using the various splitting criterion available, Gini's index, classification error or cross-entropy. Comment on your results.
Réponse 1 : On crée deux arbres de décision basés sur les attributs liés aux sépales et ceux liés aux pétales.
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]
X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X_sepal, y)
X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X_petal, y)
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
ACC (sepal) : 0.717948717948718 ACC (petal) : 0.9743589743589743
from sklearn.metrics import confusion_matrix
confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,2, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")
ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())
ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]
On a estimé les performances des deux arbres de décision suivant les metrics d'accuracy et par la matrice de confusion.
Réponse 2 : Comparons les performances avec les performances de l'arbre de décision qui garde tous les attributs
X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X, y)
y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
ax = plt.subplot(1,1,1)
sns.heatmap(confusion_matrix_all, ax=ax, annot=True, cmap="YlGnBu")
ax.set_title("Confusion Matrix for all")
ax.set_xlabel("Predictions values")
ax.set_ylabel("Actual values")
ax.xaxis.set_ticklabels(iris.target_names.tolist())
ax.yaxis.set_ticklabels(iris.target_names.tolist())
ACC (all) : 0.9230769230769231
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]
On constate qu'en utilisant l'ensemble du jeu de données les prédictions sont excellentes comme pour l'abre de décision liés simplement aux pétales
Réponse 3 : Changeons désormais le critère de décision pour comparer les performances
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]
X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed).fit(X_sepal, y)
X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed).fit(X_petal, y)
X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed).fit(X, y)
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")
confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,3, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_all, ax=ax[2], annot=True, cmap="YlGnBu")
ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())
ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())
ax[2].set_title("Confusion Matrix for all")
ax[2].set_xlabel("Predictions values")
ax[2].set_ylabel("Actual values")
ax[2].xaxis.set_ticklabels(iris.target_names.tolist())
ax[2].yaxis.set_ticklabels(iris.target_names.tolist())
ACC (sepal) : 0.7631578947368421 ACC (petal) : 0.9210526315789473 ACC (all) : 0.9210526315789473
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]
X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed).fit(X_sepal, y)
X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed).fit(X_petal, y)
X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed).fit(X, y)
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")
confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,3, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_all, ax=ax[2], annot=True, cmap="YlGnBu")
ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())
ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())
ax[2].set_title("Confusion Matrix for all")
ax[2].set_xlabel("Predictions values")
ax[2].set_ylabel("Actual values")
ax[2].xaxis.set_ticklabels(iris.target_names.tolist())
ax[2].yaxis.set_ticklabels(iris.target_names.tolist())
ACC (sepal) : 0.7142857142857143 ACC (petal) : 0.9428571428571428 ACC (all) : 0.9428571428571428
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]
X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(criterion="gini", random_state=rng_seed).fit(X_sepal, y)
X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(criterion="gini", random_state=rng_seed).fit(X_petal, y)
X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(criterion="gini", random_state=rng_seed).fit(X, y)
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")
confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,3, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_all, ax=ax[2], annot=True, cmap="YlGnBu")
ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())
ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())
ax[2].set_title("Confusion Matrix for all")
ax[2].set_xlabel("Predictions values")
ax[2].set_ylabel("Actual values")
ax[2].xaxis.set_ticklabels(iris.target_names.tolist())
ax[2].yaxis.set_ticklabels(iris.target_names.tolist())
ACC (sepal) : 0.7631578947368421 ACC (petal) : 0.9210526315789473 ACC (all) : 0.9473684210526315
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]
On met tout sous forme de tableau pour résumer :
from IPython import display
display.Image("Summary_table.png")
On constate presque toujours une précision similaire entre le jeu de données réduit aux données de sépales et le jeu de données total. Avec plus de données et un problème plus complexe nous pourrions peut être voir les avantages et inconvénients de chaque critère de décision, mais ce n'est pas le cas ici.
Try the same approach adapted to another toy dataset from scikit-learn described at:
http://scikit-learn.org/stable/datasets/index.html
Play with another dataset available at: http://archive.ics.uci.edu/ml/datasets.html
wine= load_wine(as_frame=True)
data_full = wine.data.copy()
data_full['classes'] = wine.target.copy()
X = wine.data.copy()
y = wine.target.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=rng_seed)
sns.pairplot(data_full, hue="classes")
<seaborn.axisgrid.PairGrid at 0x7f8e358b7d00>
Il est cette fois-ci plus difficile de voir une classification triviale à partir des attributs mis 2 à 2. Cependant, il y a quelques attributs qui sparent bien les différentes classes (alcohol, macid_alic, color_intensity,...), ce sont ces attributs qu'il faut prioriser.
corr = X.corr()
corr.style.background_gradient(cmap='coolwarm')
| alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| alcohol | 1.000000 | 0.094397 | 0.211545 | -0.310235 | 0.270798 | 0.289101 | 0.236815 | -0.155929 | 0.136698 | 0.546364 | -0.071747 | 0.072343 | 0.643720 |
| malic_acid | 0.094397 | 1.000000 | 0.164045 | 0.288500 | -0.054575 | -0.335167 | -0.411007 | 0.292977 | -0.220746 | 0.248985 | -0.561296 | -0.368710 | -0.192011 |
| ash | 0.211545 | 0.164045 | 1.000000 | 0.443367 | 0.286587 | 0.128980 | 0.115077 | 0.186230 | 0.009652 | 0.258887 | -0.074667 | 0.003911 | 0.223626 |
| alcalinity_of_ash | -0.310235 | 0.288500 | 0.443367 | 1.000000 | -0.083333 | -0.321113 | -0.351370 | 0.361922 | -0.197327 | 0.018732 | -0.273955 | -0.276769 | -0.440597 |
| magnesium | 0.270798 | -0.054575 | 0.286587 | -0.083333 | 1.000000 | 0.214401 | 0.195784 | -0.256294 | 0.236441 | 0.199950 | 0.055398 | 0.066004 | 0.393351 |
| total_phenols | 0.289101 | -0.335167 | 0.128980 | -0.321113 | 0.214401 | 1.000000 | 0.864564 | -0.449935 | 0.612413 | -0.055136 | 0.433681 | 0.699949 | 0.498115 |
| flavanoids | 0.236815 | -0.411007 | 0.115077 | -0.351370 | 0.195784 | 0.864564 | 1.000000 | -0.537900 | 0.652692 | -0.172379 | 0.543479 | 0.787194 | 0.494193 |
| nonflavanoid_phenols | -0.155929 | 0.292977 | 0.186230 | 0.361922 | -0.256294 | -0.449935 | -0.537900 | 1.000000 | -0.365845 | 0.139057 | -0.262640 | -0.503270 | -0.311385 |
| proanthocyanins | 0.136698 | -0.220746 | 0.009652 | -0.197327 | 0.236441 | 0.612413 | 0.652692 | -0.365845 | 1.000000 | -0.025250 | 0.295544 | 0.519067 | 0.330417 |
| color_intensity | 0.546364 | 0.248985 | 0.258887 | 0.018732 | 0.199950 | -0.055136 | -0.172379 | 0.139057 | -0.025250 | 1.000000 | -0.521813 | -0.428815 | 0.316100 |
| hue | -0.071747 | -0.561296 | -0.074667 | -0.273955 | 0.055398 | 0.433681 | 0.543479 | -0.262640 | 0.295544 | -0.521813 | 1.000000 | 0.565468 | 0.236183 |
| od280/od315_of_diluted_wines | 0.072343 | -0.368710 | 0.003911 | -0.276769 | 0.066004 | 0.699949 | 0.787194 | -0.503270 | 0.519067 | -0.428815 | 0.565468 | 1.000000 | 0.312761 |
| proline | 0.643720 | -0.192011 | 0.223626 | -0.440597 | 0.393351 | 0.498115 | 0.494193 | -0.311385 | 0.330417 | 0.316100 | 0.236183 | 0.312761 | 1.000000 |
Les données sont beaucoup moins corrélés que pour le dataset d'iris, nous pouvons utiliser le dataset dans son entiereté.
models = [tree.DecisionTreeClassifier(random_state=rng_seed),
tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed),
tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed),
RandomForestClassifier(n_estimators=30, max_depth=4, random_state=rng_seed)]
models_name = ["Decision Tree Gini", "Decision Tree Entropy", "Decision Tree Logloss", "Random Forest"]
colors = ["orange", "red", "blue", "green"]
acc = []
fig, ax = plt.subplots(2,2, figsize=(16, 8))
for i, model in enumerate(models):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc.append(accuracy_score(y_test, y_pred))
cm = confusion_matrix(np.choose(y_test, iris.target_names), np.choose(y_pred, iris.target_names), labels=iris.target_names)
sns.heatmap(cm, ax=ax[i//2][i%2], annot=True, cmap="YlGnBu")
ax[i//2][i%2].set_title(f"CM for {models_name[i]}")
ax[i//2][i%2].xaxis.set_ticklabels(iris.target_names.tolist())
ax[i//2][i%2].yaxis.set_ticklabels(iris.target_names.tolist())
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
for i, model in enumerate(models):
plt.bar(x=np.arange(4)[i], height=acc[i], label=models_name[i])
plt.title("Accuracy score for each model")
plt.ylabel("ACC")
plt.legend()
Interprétation : Encore une fois, on remarque que l'utilisation d'une Random Forest est plus efficace de part la réduction de variance dû à l'utilisation de plusieurs arbres de décision. Les méthodes Gini, Entropy et Log_loss sont peu déterminantes au vu de la performance bien plus élevée de la Random Forest. Il est aussi important de constater que l'on obtient déjà plus de 80% d'accuracy avec des modèles aussi simples que des arbres de décisions ce qui invite tout lecteur à ne pas négliger leur utilisation à l'abord d'un problème de classification ou de régression supervisée.
Go to
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
for a documentation about the RandomForestClassifier provided by scikit-learn.
Since target values must be integers, we first need to transform labels into numbers as below.
# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]
# View target
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2])
# Create a random forest Classifier. By convention, clf means 'Classifier'
rf = RandomForestClassifier(n_jobs=2, random_state=0)
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rf.fit(train[features], y)
RandomForestClassifier(n_jobs=2, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(n_jobs=2, random_state=0)
Make predictions and create actual english names for the plants for each predicted plant class:
preds = rf.predict(test[features])
preds_names = pd.Categorical.from_codes(preds, iris.target_names)
preds_names
['setosa', 'setosa', 'setosa', 'setosa', 'setosa', ..., 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'] Length: 38 Categories (3, object): ['setosa', 'versicolor', 'virginica']
cm = confusion_matrix(test['species'], preds_names, labels=iris.target_names)
ax = plt.subplot(1,1,1)
sns.heatmap(cm, ax=ax, annot=True, cmap="YlGnBu")
ax.xaxis.set_ticklabels(iris.target_names.tolist())
ax.yaxis.set_ticklabels(iris.target_names.tolist())
plt.title('Confusion matrix Random Forest')
plt.show()
One of the interesting use cases for random forest is feature selection. One of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.
When a certain tree uses one variable and another doesn't, you can compare the value lost or gained from the inclusion/exclusion of that variable. The good random forest implementations are going to do that for you, so all you need to do is know which method or variable to look at.
While we don't get regression coefficients like with ordinary least squares (OLS), we do get a score telling us how important each feature was in classifying. This is one of the most powerful parts of random forests, because we can clearly see that petal attributes was more important in classification than sepal attributes.
# View a list of the features and their importance scores
list(zip(rf.feature_names_in_, rf.feature_importances_))
[('sepal length (cm)', 0.07878326083724368),
('sepal width (cm)', 0.0268273291594686),
('petal length (cm)', 0.40142608683728426),
('petal width (cm)', 0.49296332316600344)]
Comment on the feature importances with respect to your previous observations on decision trees above.
Extract and visualize 5 trees belonging to the random forest using the attribute estimators_ of the trained random forest classifier. Compare them. Note that you may code a loop on extracted trees.
Study the influence of parameters like max_depth, min_samples_leaf and min_samples_split. Try to optimize them and explain your approach and choices.
How is estimated the prediction error of a random forest ?
Indication: have a look at parameter oob_score.
What are out-of-bag samples ?
Réponse 1 : A la lumière des résultats précédents, on se rend compte que les attributs liés aux pétales sont bien plus importants dans la classfication que les attributs liés aux sépales. Cela éclaire donc bien les résultats précédents qui montraient qu'un arbre de décision simplement sur les attributs liés aux pétales est déjà très efficace.
Réponse 2 : On extrait puis on enregistre dans rf_results 5 arbres de décision de la Random Forest.
random.seed(10)
trees = random.sample(rf.estimators_, k=5)
for i, one_tree in enumerate(trees):
dot_data = tree.export_graphviz(one_tree, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph.render(f"rf_results/tree_{i}")
y_pred = one_tree.predict(test.iloc[:, 0:4])
print(f"ACC for tree_{i} = {accuracy_score(y_true, y_pred)}")
ACC for tree_0 = 0.9473684210526315 ACC for tree_1 = 0.9210526315789473 ACC for tree_2 = 0.9473684210526315 ACC for tree_3 = 0.9473684210526315 ACC for tree_4 = 0.9210526315789473
En particulier, calculons une forme de score pour chaque arbre sous la forme d'une somme noramlisée des scores d'importances calculés précédemment :
Donc en termes d'"importance", les deux arbres ayant un score d'importance les plus faibles sont bien les arbres 0 et 4 qui sont ceux ayant une accuracy plus faible.
Réponse 3 : Jouons avec les hyperparamètres.
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
DEPTH = np.linspace(2,8,7).astype(int)
abs, ord = [], []
for max_depth in DEPTH:
rf = RandomForestClassifier(n_jobs=2, max_depth=max_depth, random_state=rng_seed)
rf.fit(train[features], y)
y_pred = rf.predict(test[features])
abs.append(max_depth)
ord.append(accuracy_score(y_true, y_pred))
plt.plot(abs, ord)
plt.title("Evolution de la précision selon la profondeur maximale")
plt.xlabel("Max_depth")
plt.ylabel("Accuracy score")
Text(0, 0.5, 'Accuracy score')
LEAF = np.linspace(1,20,20).astype(int)
abs, ord = [], []
for min_samples_leaf in LEAF:
rf = RandomForestClassifier(n_jobs=2, min_samples_leaf=min_samples_leaf, random_state=rng_seed)
rf.fit(train[features], y)
y_pred = rf.predict(test[features])
abs.append(min_samples_leaf)
ord.append(accuracy_score(y_true, y_pred))
plt.plot(abs, ord)
plt.title("Evolution de la précision selon min_samples_leaf")
plt.xlabel("min_samples_leaf")
plt.ylabel("Accuracy score")
Text(0, 0.5, 'Accuracy score')
SPLIT = np.linspace(2,40,39).astype(int)
abs, ord = [], []
for min_samples_split in SPLIT:
rf = RandomForestClassifier(n_jobs=2, min_samples_split=min_samples_split, random_state=rng_seed)
rf.fit(train[features], y)
y_pred = rf.predict(test[features])
abs.append(min_samples_split)
ord.append(accuracy_score(y_true, y_pred))
plt.plot(abs, ord)
plt.title("Evolution de la précision selon min_samples_split")
plt.xlabel("min_samples_leaf")
plt.ylabel("Accuracy score")
Text(0, 0.5, 'Accuracy score')
Après avoir identifié l'effet de chaque hyperparamètre individuellement, on regarde l'effet que chaque combinaison a de manière combinée. Cela prend plus de temps (environ 5min) algorithmiquement parlant mais est censé donner un résultat optimal.
from tqdm import tqdm
DEPTH = np.linspace(2,8,7).astype(int)
LEAF = np.linspace(1,8,8).astype(int)
SPLIT = np.linspace(2,40,40).astype(int)
scores = [0]
for i, max_depth in tqdm(enumerate(DEPTH), total=len(DEPTH)) :
for j, min_samples_leaf in enumerate(LEAF):
for k, min_samples_split in enumerate(SPLIT):
rf = RandomForestClassifier(n_jobs=2, max_depth=max_depth,
min_samples_leaf=min_samples_leaf,
min_samples_split=min_samples_split, random_state=rng_seed)
rf.fit(train[features], y)
y_pred = rf.predict(test[features])
if accuracy_score(y_true, y_pred) > max(scores):
best_combination = (max_depth, min_samples_leaf, min_samples_split)
print(best_combination)
print(accuracy_score(y_true, y_pred))
scores.append(accuracy_score(y_true, y_pred))
0%| | 0/7 [00:00<?, ?it/s]
(2, 1, 2) 0.9705882352941176
43%|████▎ | 3/7 [02:46<03:39, 54.95s/it]
(5, 8, 35) 1.0
100%|██████████| 7/7 [06:19<00:00, 54.19s/it]
print("La meilleure combinaison est : ")
print(f"- Max_depth = {best_combination[0]}")
print(f"- Min_samples_leaf = {best_combination[1]}")
print(f"- Min_samples_split = {best_combination[2]}")
La meilleure combinaison est : - Max_depth = 5 - Min_samples_leaf = 8 - Min_samples_split = 35
Nous arrivons à avoir une précision de 1 avec le dataset de test généré.
Cette combinaison semble être cohérente avec l'effet individuel des hyperparamètres sur les prédictions dessiné précédemment. Si la profondeur optimale (au regard des autres hyperparamètres) semble être faible, il faut se rappeler qu'une forêt aléatoire tire sa force du principe de comités experts où elle peut obtenir de très bon résultat à partir d'un vote selon un ensemble d'arbres très simples ! Cependant, il reste difficile d'en tirer de bonnes conclusions étant donné la simplicité du dataset et de la classification demandée.
La technique que nous avons utiisé est très couteuse et n'est pas possible en un temps raisonnable pour un très grand dataset, le moyen le plus simple reste d'optimiser les paramètres un par un en commencant par min_samples_leaf et min_samples_split, qui, d'après le papier de recherche "An empirical study on hyperparameter tuning of deciion trees" de Rafael Gomes Mantovani, Tomas Horvath, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André Carlo Ponce de Leon Ferrreira de Carvalho, indique que ce sont ces hyperparamètres qui influent le plus.
Réponse 4 : Pour estimer l'erreur de prédiction on utilise le "out-of-bag error"(OOB error). Le RandomForestClassifier est entraîné en utilisant l'agrégation bootstrap, où chaque nouvel arbre est ajusté à partir d'un échantillon bootstrap des observations d'entraînement $z_i = (x_i, y_i)$. L'erreur out-of-bag (OOB) est l'erreur moyenne pour chaque $z_i$ calculée en utilisant les prédictions des arbres qui ne contiennent pas $z_i$ dans leur échantillon bootstrap respectif (OOB sample). Cela permet d'ajuster et de valider le RandomForestClassifier tout en le formant.
Réponse 5 : Lorsque le dataset présente des classes en quantité non équivalentes (comme ça peut être le cas dans la détection de fraudes par exemple) l'algorithme reçoit beaucoup plus d'exemples d'une classe, ce qui l'incite à privilégier cette classe particulière. Il n'apprend pas ce qui rend l'autre classe "différente" et ne parvient pas à comprendre les modèles sous-jacents qui nous permettent de distinguer les classes.
Pour y pallier, on peut :
Random forest is capable of learning without carefully crafted data transformations. Take the the $f(x) = \sin(x)$ function for example.
Create some fake data and add a little noise.
x = np.random.uniform(-2.5, 2.5, 1000)
y = np.sin(x) + np.random.normal(0, .1, 1000)
plt.plot(x,y,'ko',markersize=1,label='data')
plt.plot(np.arange(-2.5,2.5,0.1),np.sin(np.arange(-2.5,2.5,0.1)),'r-',label='ref')
plt.show()
If we try and build a basic linear model to predict y using x we end up with a straight line that sort of bisects the sin(x) function. Whereas if we use a random forest, it does a much better job of approximating the sin(x) curve and we get something that looks much more like the true function.
Based on this example, we will illustrate how the random forest isn't bound by linear constraints.
Note that ordinay least square regression is available thanks to: from sklearn.linear_model import LinearRegression
You may use half of points for training and others to test predictions. Then you will have an idea of how far the random forest predictor fits the sinus curve.
To this aim, you will need to use the model RandomForestRegressor. Be careful that when only 1 feature x is used as an input, you will need to reshape it by x.reshape(-1,1) when using methods fit and predict.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=rng_seed)
One clever way to compare models when using scikit-learnis to make a loop on models as follows:
models = [LinearRegression(fit_intercept=True),
tree.DecisionTreeRegressor(random_state=rng_seed),
RandomForestRegressor(n_estimators=30, max_depth=4, random_state=rng_seed)]
models_name = ["Linear Regression", "Decision Tree", "Random Forest"]
colors = ["orange", "blue", "green"]
mae = []
fig, ax = plt.subplots(1,2, figsize=(16,8))
ax[0].plot(np.arange(-2.5,2.5,0.1),np.sin(np.arange(-2.5,2.5,0.1)),'r-',label='Theoretical curve')
for i, model in enumerate(models):
model.fit(X_train.reshape(-1,1), y_train)
y_pred = model.predict(X_test.reshape(-1,1))
mae.append(mean_absolute_error(y_test, y_pred))
if i == 0:
ax[0].plot(X_test, y_pred, label=models_name[i], color=colors[i])
else :
ax[0].scatter(X_test, y_pred, label=models_name[i], color=colors[i], edgecolors="black", s=20)
ax[0].legend()
ax[0].set_xlabel("x")
ax[0].set_ylabel("y")
ax[0].set_title("Approximation of the sinus function")
ax[1].bar(x=np.arange(3)[i], height=mae[i], label=models_name[i])
ax[1].set_title("Mean Absolute Error for each model")
ax[1].set_ylabel("MAE")
ax[1].legend()
Interprétation : La fonction sinus est une fonction non linéaire, ce qui explique que la méthode OLS donne une approximation grossière étant donné qu'elle essaye d'approximer linéairement. Contrairement à la méthode OLS, les arbres de décision ne supposent pas que les données suivent un modèle linéaire et peuvent ainsi permettrent d'approximer correctement des problèmes non linéaires. On constate également que les familles d'algorithmes par comités d'experts dont font partie les Random Forest permettent de réduire la variance des prédictions. Cela se constate sur le scatterplot où l'on voit bien que les points verts sont "moins dispersés" autour de la courbe. Cela permet d'obtenir une meilleure approximation comme le prouve la valeur de la Mean_Absolute_Error.
http://scikit-learn.org/stable/modules/tree.html
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
plt.contourf¶Since post-pruning of tree is not implemented in scikit-learn, you may think of coding your own pruning function. For instance, taking into account the numer of samples per leaf as proposed below:
# Pruning function (useful ?)
def prune(decisiontree, min_samples_leaf = 1):
if decisiontree.min_samples_leaf <= min_samples_leaf:
raise Exception('Tree already more pruned')
else:
decisiontree.min_samples_leaf = min_samples_leaf
tree = decisiontree.tree_
for i in range(tree.node_count):
n_samples = tree.n_node_samples[i]
if n_samples <= min_samples_leaf:
tree.children_left[i]=-1
tree.children_right[i]=-1
Reprenons l'exemple du dataset wine pour voir l'impact du post pruning sur un arbre. Le post pruning se fait sur un arbre seul pour éviter l'overfitting. Un modèle random forest utilisant du bootstrapping avec plusieurs arbres (faible correlation entre les arbres), il n'est pas nécessaire de faire du post-pruning.
L'arbre avec les meilleurs résultats était celui utilisant le criterion entropy, nous allons le réutiliser.
wine= load_wine(as_frame=True)
data_full = wine.data.copy()
data_full['classes'] = wine.target.copy()
X = wine.data.copy()
y = wine.target.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=rng_seed)
clf = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed)
clf.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(criterion='entropy')
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=wine.feature_names,
class_names=wine.target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
Notre arbre n'a pas assez de noeuds et de feuilles pour utiliser du post-prune.